Association Rules Apriori Algorithm

The apriori algorithm is an unsupervised rule learning algorithm that aims to find patterns found in the relationships among items in the dataset. The analysis is performed based on the idea of market basket analysis which looks at patterns of co-occurrence. The apriori algorithm is referred to as a “smart” rule learner since instead of evaluating each item one by one, which is computationally expensive, it takes advantage of the fact that some combinations rarely occur and ignores them. Some strengths are that is can handle large datasets and results in rules that are easy for a human to understand. Some weaknesses are that it is not as effective on smaller datasets and that it can easily draw false conclusions from random patterns.

Step 1 - Collect Data

The data is of all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based online retail company. The source is Dr Daqing Chen at the School of Engineering, London South Bank University.

importing the data

data = read_excel('Online Retail.xlsx')

Step 2 - Exploring And Preparing The Data

The dataset has 541,909 observations where each observation represents a transaction, and 8 features representing information on the transactions.

preparing data

# only keep rows that are complete
data = data[complete.cases(data), ]

# converting variables to appropriate type
data %>% mutate(Description = as.factor(Description))
data %>% mutate(Country = as.factor(Country))
data %>% mutate(StockCode = as.factor(StockCode))
# Converts character data to date. Store InvoiceDate as date in new variable
data$Date = as.Date(data$InvoiceDate)

# Extract time from InvoiceDate and store in another variable
TransTime = format(data$InvoiceDate,"%H:%M:%S")
# Convert and edit InvoiceNo into numeric
InvoiceNo = as.numeric(as.character(data$InvoiceNo))
## Warning: NAs introduced by coercion
# Bind new columns TransTime and InvoiceNo into dataframe
cbind(data,TransTime)
cbind(data,InvoiceNo)
str(data)
## tibble [406,829 × 9] (S3: tbl_df/tbl/data.frame)
##  $ InvoiceNo  : chr [1:406829] "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr [1:406829] "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr [1:406829] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num [1:406829] 6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct[1:406829], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num [1:406829] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num [1:406829] 17850 17850 17850 17850 17850 ...
##  $ Country    : chr [1:406829] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ Date       : Date[1:406829], format: "2010-12-01" "2010-12-01" ...

storing transaction data into new dataframe

#ddply(dataframe, variables_to_be_used_to_split_data_frame, function_to_be_applied)
transactionData = ddply(data,c("InvoiceNo","Date"),
                       function(df1)paste(df1$Description,
                       collapse = ","))
#set column InvoiceNo of dataframe transactionData  
transactionData$InvoiceNo = NULL
#set column Date of dataframe transactionData
transactionData$Date = NULL
#Rename column to items
colnames(transactionData) = c("items")
#Show Dataframe transactionData
transactionData

write transactionData to new csv file

write.csv(transactionData,"market_basket_transactions.csv", quote = FALSE, row.names = FALSE)

read from csv into object of transaction class

tr = suppressWarnings(read.transactions('market_basket_transactions.csv', format = 'basket', sep=','))
tr
## transactions in sparse format with
##  22191 transactions (rows) and
##  7876 items (columns)
summary(tr)
## transactions as itemMatrix in sparse format with
##  22191 rows (elements/itemsets/transactions) and
##  7876 columns (items) and a density of 0.001930725 
## 
## most frequent items:
## WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
##                               1803                               1709 
##            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
##                               1460                               1285 
##      ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
##                               1250                             329938 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 3598 1594 1141  908  861  758  696  676  663  593  624  537  516  531  551  522 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  464  441  483  419  395  315  306  272  238  253  229  213  222  215  170  159 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  138  142  134  109  111   90  113   94   93   87   88   65   63   67   63   60 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   59   49   64   40   41   49   43   36   29   39   30   27   28   17   25   25 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   20   27   24   22   15   20   19   13   16   16   11   15   12    7    9   14 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   15   12    8    9   11   11   14    8    6    5    6   11    6    4    4    3 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    6    5    2    4    2    4    4    3    2    2    6    3    4    3    2    1 
##  113  114  116  117  118  120  121  122  123  125  126  127  131  132  133  134 
##    3    1    3    3    3    1    2    2    1    3    2    2    1    1    2    1 
##  140  141  142  143  145  146  147  150  154  157  168  171  177  178  180  202 
##    1    2    2    1    1    2    1    1    3    2    2    2    1    1    1    1 
##  204  228  236  249  250  285  320  400  419 
##    1    1    1    1    1    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00   10.00   15.21   21.00  419.00 
## 
## includes extended item information - examples:
##                       labels
## 1                   1 HANGER
## 2     10 COLOUR SPACEBOY PEN
## 3 12 COLOURED PARTY BALLOONS
# examine the frequency of items
itemFrequency(tr[,1:3])
##                   1 HANGER     10 COLOUR SPACEBOY PEN 
##                0.002343292                0.008832410 
## 12 COLOURED PARTY BALLOONS 
##                0.005317471
# item frequency plot for top 20 items
itemFrequencyPlot(tr, type="absolute", topN = 20, main="Absolute Item Frequency Plot")

itemFrequencyPlot(tr, topN = 20, main="Relative Item Frequency Plot")

# visualization of a random sample of 1000 transactions
image(sample(tr, 1000))

Step 3 - Training A Model On The Data

To train the model, which in this case means producing the association rules, we use the apriori function from the arules R package. We pass our dataset as a parameter, along with values for the support, confidence and minlen variables. The variables are what we use as parameters to tune the output so that we find a balance between generating too many rules, and generating zero or only generic rules.

generating rules

association.rules = apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7876 item(s), 22191 transaction(s)] done [0.39s].
## sorting and recoding items ... [2324 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8, maxlen =
## 10)): Mining stopped (maxlen reached). Only patterns up to a length of 10
## returned!
##  done [0.28s].
## writing ... [49122 rule(s)] done [0.03s].
## creating S4 object  ... done [0.02s].

Step 4 - Evaluating Model Performance

looking at the first 10 rules

inspect(association.rules[1:10])
##      lhs                            rhs                                support confidence    coverage     lift count
## [1]  {WOBBLY CHICKEN}            => {DECORATION}                   0.001261773  1.0000000 0.001261773 443.8200    28
## [2]  {WOBBLY CHICKEN}            => {METAL}                        0.001261773  1.0000000 0.001261773 443.8200    28
## [3]  {DECOUPAGE}                 => {GREETING CARD}                0.001036456  1.0000000 0.001036456 389.3158    23
## [4]  {BILLBOARD FONTS DESIGN}    => {WRAP}                         0.001306836  1.0000000 0.001306836 715.8387    29
## [5]  {WRAP}                      => {BILLBOARD FONTS DESIGN}       0.001306836  0.9354839 0.001396963 715.8387    29
## [6]  {ENAMEL PINK TEA CONTAINER} => {ENAMEL PINK COFFEE CONTAINER} 0.001396963  0.8157895 0.001712406 385.1741    31
## [7]  {WOBBLY RABBIT}             => {DECORATION}                   0.001532153  1.0000000 0.001532153 443.8200    34
## [8]  {WOBBLY RABBIT}             => {METAL}                        0.001532153  1.0000000 0.001532153 443.8200    34
## [9]  {ART LIGHTS}                => {FUNK MONKEY}                  0.001712406  1.0000000 0.001712406 583.9737    38
## [10] {FUNK MONKEY}               => {ART LIGHTS}                   0.001712406  1.0000000 0.001712406 583.9737    38
summary(association.rules)
## set of 49122 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9    10 
##   105  2111  6854 16424 14855  6102  1937   613   121 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   5.000   5.499   6.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001036   Min.   :0.8000   Min.   :0.001036   Min.   :  9.846  
##  1st Qu.:0.001082   1st Qu.:0.8333   1st Qu.:0.001262   1st Qu.: 22.237  
##  Median :0.001262   Median :0.8788   Median :0.001442   Median : 28.760  
##  Mean   :0.001417   Mean   :0.8849   Mean   :0.001609   Mean   : 64.589  
##  3rd Qu.:0.001532   3rd Qu.:0.9259   3rd Qu.:0.001712   3rd Qu.: 69.200  
##  Max.   :0.015997   Max.   :1.0000   Max.   :0.019107   Max.   :715.839  
##      count       
##  Min.   : 23.00  
##  1st Qu.: 24.00  
##  Median : 28.00  
##  Mean   : 31.45  
##  3rd Qu.: 34.00  
##  Max.   :355.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr         22191   0.001        0.8
##                                                                         call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10))

Looking at the quality measure summary we can see that there are a lot of rules above our minimum confidence and support parameters. If most of the rules were topping off at the values we set than that would be an indicator that we set them too aggresively but looking at the summary we can see that the 3rd quartile for confidence is at 0.92, which means there are plenty of rules with greater than 0.8 confidence, and the mean for support is 0.0014, above our 0.001 level.

Looking at the first 10 rules though we notice that while the confidence levels are very high, the rules themselves are not very useful. They are pretty obvious rules and are not very meaningful. What is note worthy is all these generic rules are only 1 in length.

Step 5 - Improving Model Performance

We can improve the performance of our rules by making the results more actionable. This means being able to sort and filter rules to find the more interesting or meaningful rules. We also have several parameters can be used to tune the rules generated. The parameters are the support, confidence and minlen. The support represents the minimum required rule support, the confidence represents the minimum required rule confidence and the minlen specifies the number of required rule items.

generating rules with a minlen parameter to eliminate obvious rules.

association.rules = apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10,minlen=3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      3
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7876 item(s), 22191 transaction(s)] done [0.18s].
## sorting and recoding items ... [2324 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10,
## : Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [0.27s].
## writing ... [49017 rule(s)] done [0.02s].
## creating S4 object  ... done [0.02s].

evaluating performance

inspect(association.rules[1:10])
##      lhs                                 rhs                                 support confidence    coverage      lift count
## [1]  {DECORATION,                                                                                                          
##       WOBBLY CHICKEN}                 => {METAL}                         0.001261773  1.0000000 0.001261773 443.82000    28
## [2]  {METAL,                                                                                                               
##       WOBBLY CHICKEN}                 => {DECORATION}                    0.001261773  1.0000000 0.001261773 443.82000    28
## [3]  {DECORATION,                                                                                                          
##       WOBBLY RABBIT}                  => {METAL}                         0.001532153  1.0000000 0.001532153 443.82000    34
## [4]  {METAL,                                                                                                               
##       WOBBLY RABBIT}                  => {DECORATION}                    0.001532153  1.0000000 0.001532153 443.82000    34
## [5]  {BLACK TEA,                                                                                                           
##       SUGAR JARS}                     => {COFFEE}                        0.002072912  1.0000000 0.002072912  69.34687    46
## [6]  {BLACK TEA,                                                                                                           
##       COFFEE}                         => {SUGAR JARS}                    0.002072912  1.0000000 0.002072912 238.61290    46
## [7]  {FRENCH BLUE METAL DOOR SIGN 0,                                                                                       
##       FRENCH BLUE METAL DOOR SIGN 9}  => {FRENCH BLUE METAL DOOR SIGN 7} 0.001577216  0.9722222 0.001622279 303.86737    35
## [8]  {FRENCH BLUE METAL DOOR SIGN 7,                                                                                       
##       FRENCH BLUE METAL DOOR SIGN 9}  => {FRENCH BLUE METAL DOOR SIGN 0} 0.001577216  0.8139535 0.001937723 291.32971    35
## [9]  {FRENCH BLUE METAL DOOR SIGN 0,                                                                                       
##       FRENCH BLUE METAL DOOR SIGN 7}  => {FRENCH BLUE METAL DOOR SIGN 9} 0.001577216  0.8750000 0.001802533 340.65132    35
## [10] {FRENCH BLUE METAL DOOR SIGN 0,                                                                                       
##       FRENCH BLUE METAL DOOR SIGN 9}  => {FRENCH BLUE METAL DOOR SIGN 8} 0.001396963  0.8611111 0.001622279 272.98452    31
summary(association.rules)
## set of 49017 rules
## 
## rule length distribution (lhs + rhs):sizes
##     3     4     5     6     7     8     9    10 
##  2111  6854 16424 14855  6102  1937   613   121 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.507   6.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001036   Min.   :0.8000   Min.   :0.001036   Min.   :  9.846  
##  1st Qu.:0.001082   1st Qu.:0.8333   1st Qu.:0.001262   1st Qu.: 22.236  
##  Median :0.001262   Median :0.8788   Median :0.001397   Median : 28.760  
##  Mean   :0.001411   Mean   :0.8848   Mean   :0.001601   Mean   : 64.262  
##  3rd Qu.:0.001532   3rd Qu.:0.9259   3rd Qu.:0.001712   3rd Qu.: 68.703  
##  Max.   :0.015997   Max.   :1.0000   Max.   :0.019107   Max.   :461.060  
##      count       
##  Min.   : 23.00  
##  1st Qu.: 24.00  
##  Median : 28.00  
##  Mean   : 31.31  
##  3rd Qu.: 34.00  
##  Max.   :355.00  
## 
## mining info:
##  data ntransactions support confidence
##    tr         22191   0.001        0.8
##                                                                                     call
##  apriori(data = tr, parameter = list(supp = 0.001, conf = 0.8, maxlen = 10, minlen = 3))

Once again our quality measure summary looks good, but this time the first 10 rules are becoming more meaningful. We now can see that buying black tea and sugar jars led to coffee 100% of the time, and the blue metal door sign 0 and 9 led to buying sign 8 86% of the time.

writing rules to csv file

write(association.rules, file="rules.csv", sep=",", quote=TRUE, row.names=FALSE)

Autograding

Note: Had to comment out the autograder in order to knit

I was getting errors if I didn’t.

#.AutograderMyTotalScore()